In silico comparison of Iranian HIV -1 envelop glycoprotein with five nearby countries.

HIV-1 envelope (env) glycoprotein mediates an important role in entry of the virus into the susceptible target cells. As env glycoprotein of HIV-1 is highly variable in the different geographical regions, in the present study, different properties of this protein in Iran are compared with five nearby countries. The sequences of HIV-1 env glycoproteins of Iran, Afghanistan, Russia, Turkey, Pakistan and Saudi Arabia databases were collected from databases. Amino acid composition and physical and chemical properties of the proteins from these countries were studied using Protparam and COPid tools. Receiver-operating characteristic (ROC) curve analysis and Support Vector Machine (SVM) were used to evaluate association between the properties of HIV-1 env glycoprotein of Iran with five nearby countries. The results verify that amino acid composition and four physical and chemical properties (molecular weight, isoelectric point, Aliphatic Index, and grand average of hydropathicity) of HIV-1 env protein in Iran and Russia were not significantly different. In conclusion, the results indicate that in silico techniques provide valuable information for comparing HIV-1 envelop glycoprotein in different geographical locations.


INTRODUCTION
HIV-1 is one the most important pathogens and causes the majority of HIV infections globally [1,2]. HIV-1 is a member of Retroviridae and is sorted into three typical groups; group M (main), group O (outliner), and group N (non-M/non-O) [3,4]. The majority of the infection is caused by group M which is divided into 9 different subtypes symbolized A, B, C, D, F, G, H, J and K [5,6]. Genome of HIV is composed of two positive strands RNA which are packaged in a protein capsid and surrounded by a lipid env [7]. HIV-1 env glycoprotein is essential for entry of the virus into the cell by binding to the specific receptors on the surface of the target cells [8]. HIV-1 env glycoprotein has proved to be useful to study of variation in HIV strains by a number of approaches [4]. Some of these approaches are based on amino acid sequences, amino acid composition and physical and chemical properties of proteins. Physical and chemical properties of viral proteins are widely used to predict various aspects of proteins such as molecular weight, isoelectric point, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [9]. Several researchers have worked on evolutionary patterns of HIV-1 envelop group M in African, Asia, Europe and America [10]. In addition, phylogenetic analysis of HIV-1 env glycoprotein of different Asian countries such as India, Bangladesh, Cambodia, China and Japan were studied [11]. But the diversity of HIV-1 envelops in Iran and some nearby courtiers have not been investigated yet. In the present study, the Amino acid composition and some physical and chemical properties of HIV-1 env proteins of different subtypes in Iran and five nearby countries are studied.

MATERIALS AND METHODS
Data collection: Amino acid sequences of HIV-1 env protein from six different countries (Iran, Pakistan, Russia, Saudi Arabia, Afghanistan and Turkey) were collected from NCBI (http://www.ncbi.nlm.gov/protein). Two hundred sixty two amino acid sequences from Iran and 752 sequences from five nearby countries (349 Pakistan, 280 Russia, 62 Saudi Arabia, 34 Afghanistan and 27 Turkey) were studied. Redundant sequences (more than 95% similarity) were removed from our dataset by CD-HIT. After running CD-HIT, the number of sequences existing in the dataset in Iran, Pakistan, Russia, Saudi Arabia, Afghanistan and Turkey were 123, 123, 50, 26, 12 and 30 respectively. In the next step, all nucleotide sequences of these 6 groups were also collected from NCBI.

Server and tools:
Context-based Modeling for Expeditious Typing (COMET): All nucleotide sequences of these HIV-1 env genes of the 6 countries were analyzed using the Calibrated Population Resistance (CRP) subtyping tool COMET to predict different subtypes. COMET (v. 0.2) is a reliable tool to predict HIV-1/2 subtypes [12]. This tool is available at (http://comet.retrovirology.lu) and uses Prediction Partial Matching (PPM) compression algorithm.
ProtParam tool: ProtParam is a server which computes different physical and chemical parameters of a protein [13]. The web-server is available at http://web.expasy.org/protparam. Four characteristics (molecular weight, isoelectric point, GRAVY and aliphatic index) of HIV-1 env glycoprotein from Iran and five nearby countries were evaluated by ProtParam.
Composition protein identification: The amino acid sequences of env glycoprotein from these six countries were analyzed using Composition Based Protein Identification (COPid). COPid is a server which analyzes composition of various types of amino acids [14]. The COPid web-server is available at http://www.imtech.res .in/raghava/ COPid. This server helps the researchers to elucidate the function of a protein and generate a phylogenetic tree from its composition.
Statistical Analysis: The data were evaluated by Receiver Operating Characteristic (ROC) curve and Molegro Data Modeller (MDM). ROC curve is a tool for organizing classifiers and visualizing their performance. ROC curves are usually used in machine learning and data mining research [15]. ROC server can calculate accuracy (ACC) and compute association between the properties of HIV-1 env glycoprotein in Iran and five nearby countries near Iran. ACC is a factor to differentiate positive and negative classes of data. When ACC value is more than 0.8, it means that the difference between two classes is significant. ACC is calculated using the following formula: ACC = Σ True positive + Σ True negative/ Σ Total population MDM is a cross-platform application for data mining and data visualization. MDM generates regression and classification models by partial least squares, neural networks and support vector machines [16]. SVMs are supervised learning algorithms which are broadly used in classification or regression problems [17].

RESULTS
The result of HIV-1 subtyping in Iran, Russia, Turkey, Saudi Arabia, Pakistan and Afghanistan are summarized in Table 1. The results showed that subtype A was the dominant subtype in Iran, Russia and Pakistan. But dominant subtypes in Saudi Arabia, Turkey and Afghanistan were C, B and AD, respectively.
Analysis of ProtParam results using ROC curve and SVM are presented in Tables 2  and 3. Table 2 shows that ACC values of GRAVY, aliphatic index, isoelectric point and molecular weight between Iran and Russia were 0.56, 0.65, 0.62 and 0.65 respectively. But ACC values between Iran and four other countries were more ( Table  2). As the results of ROC analysis between Iran and Russia were less than 0.8, physicochemical properties of viral env protein between these two countries were not significantly different. The results of MDM analysis approved our previous results (Table 3), because values of four parameters such as molecular weight, isoelectric point, aliphatic Index and GRAVY between Iran and Russia were less than 6.4 and the values between Iran and four other countries were more than 0.7. The results of ROC curve and MDM analyses showed that physicochemical properties of HIV-1 env protein in Iran and Russia were not significantly different.  The results of ROC curve analysis for amino acid composition of six databases are been shown in Table 4. The ACC values of 17 amino acids out of 20 amino acids between Iran and Russia were less than 70%. These amino acids were Asp, Ala, Glu, Phe, Ile, His, Gly, Leu, Asn, Lys, Arg, Pro, Gln, Thr, Val, Trp and Tyr. However ACC values between Iran and four other countries were significantly different.
The results of Molegro analysis for amino acid composition of six databases are shown in Table 5. The ACC values of 17 amino acids out of 20 amino acids between Iran and Russia databases were less than 0.7. The ACC of Cys, Met, and Ser between Iran and Russia were more than 0.7.

DISCUSSION
In the present study, association between the properties of HIV-1 env glycoprotein in Iran and five nearby countries are studied. According to literature, the most variations in HIV-1 are related to the env glycoproteins gp41 and gp120 sequences [18,19]. In the large number of cases, phylogenetic clustering of HIV-1 isolates is based on the differences in env genes nucleotide sequences [20,21]. HIV-1 env proteins of different subtypes and sub-subtypes can vary in more than 30% of their amino acids [22,23]. In recent decades, several phylogenetic classifications are proposed on the HIV-1 env glycoprotein in Asian, African, European and American countries. In 2006, Ahn and Son reported that codon usage patterns among the HIV-1 env proteins of different subtypes may be a useful method to predict the evolutionary patterns of pandemic viruses [10]. In 2007, Singh and Seth, analyzed amino acid sequences of HIV env Protein by means of Clustal X software and found the association between the sequences from different Asian countries [11]. Our results also demonstrate that amino acid composition and four physical and chemical properties of HIV-1 env protein in Iran and Russia were less than 0.6. According to literature when the ACC values are less than 0.6, it means that differences between positive and negative classes of data are not significant [24]. Therefore, viral env proteins in Iran and Russia were not significantly different. Physicochemical properties between Iran and four different countries Turkey, Afghanistan, Pakistan and were significantly different. The result also showed that subtype A is dominant in Iran and Russia. Some researchers have reported that subtype A can circulate at high rate among intravenous drug users and the most probable way for HIV-1 subtype A introduction into Iran is through Former Soviet Union countries [25,26,27]. These observations indicated that in silico properties of HIV-1 env protein in Iran and Russia are similar.